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Modern Mobile Rendering @ HypeHype 


e Research 
o Understand your target hardware (and audience) 
o Why can’t we have Nanite and Media Molecule: Dreams on a phone? 


e Design 
o What is the correct platform abstraction level? 
o The iterative API design process 
o Do things at the right frequency and granularity 


e Implementation 
o Fast & safe object lifetime tracking 
Fast 4 clean C++20 API for constructing resources 
Efficient GPU memory allocation 
Bind groups, exposed to user land 
A software command buffer, but an order of magnitude faster 
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Scope of today's presentation 


High level rendering code 


Render pipeline Visual algorithms 
Shader optimizations Scene data 
Culling Multithreading 
Device scalability Material composer 


Decaling Virtual texturing 
A content creation 


Metal Vulkan WebGPU WebGL2 Windows Android iOS Mac 


Today’s scope = bottom levels. Pretty pixels next time! 


Research 


Understanding your target audience EEE 


vivo; vivo 1906 
OPPO; CPH2083 


e Gather analytics data 
o GPU manufacturer, model, driver version 
o Amount of RAM, OS version 

e Compare analytics data to older data 


o Which devices are soon leaving the market? 


1 
2 
3 | realme; RMX3231 
4 


OPPO; CPH1909 
5 | samsun g; SM-A125F 
6 | samsung; SM-T225 
7 | OPPO; CPH2269 
8 | samsung; SM-T295 
9 | samsung; SM-A115F 


o  Extrapolate one year in the future = project ETA = ME gi SM-A035F 
o Keep tracking the data to adjust plans during production nen 
e What is the correct "min spec” hardware”? Bam 
o Android: 95% of our users have Vulkan 1.0 + Android 9 ST 
o 2GB memory (1.4GB usable) samsung: SM-A127F 
o Have to cut bottom 5% of users ee 
m New tech improves bottom 50% experience a lot realme; RMX3261 


realme; RMX3195 


= Better user retention for 95% of users 
HypeHype top 20 Androids 


Android 
Understanding your target hardware OS: Android 9 © 


CPU: 32 bit + 64 bit 
ARM: Mali-G series (Bifrost) 


e Form contacts with mobile hardware vendors Qualcomm: Adreno 500 series 
o ARM (Mali), Qualcomm (Adreno), PowerVR, Apple PowerVR: 8000 series (Rogue) 
o Present your early design. Get feedback. Ask questions © yaeta Cie ne TONES 


o Questions — Ask your IHV contacts OS: iOS 13 
e Trick: Read the new hardware marketing material CPU: 64bit = 
o Big improvements — That's still SLOW on 50%+ devices! IPhone 6s (A9) / iPad Air 2 (A8X) 
E 7 years old hardware 
e Buy test devices 
o Min spec device of every GPU vendor 
e Prototyping 
o Write a small test app to measure the most important gfx 


API features on each vendor min spec device 
o Confirm that the driver works for our use cases 


e Read the best practice guides and API docs Apple é 


Why can't we have Nanite or MM: Dreams on a phone? 


e GPU-driven rendering: 8 years ago at SIGGRAPH 2015 [1] 
e SDF ray-tracing (Claybook): 5 years ago at GDC 2018 [2] 


e Nintendo Switch (handheld) versus bottom 50%+ mobile phones [3] 

o Peak flops (~200 GFLOP/s) and mem bandwidth (~20 GB/s) are in the same ballpark 

o Nvidia GPU architecture is designed for compute (CUDA, AZDO): 
m Fast generic memory load/store - Mobile: 16KB uniform buffers! SSBOs are slow! 
m Fast & big groupshared memory - Mobile: Small or emulated 
m Fast local/global atomics and wave intrinsics - Mobile: Wave intrinsic support <10% 
= Big register files and big generic caches - Mobile: Avoid complex shaders 
m 64 bit atomics - Mobile: No 64 bit integers at all! 

o Modern PC graphics: 3d tiling layout for volume textures. Big deal for SDF rendering 


e 50%+ of mobile phones: Designed to run existing GLES 3.0 games efficiently 


Design 


What is the correct platform abstraction level? 


Flutter app 
Game engines framework Mobile apps OLD — HypeHype > NEW 


Platform Business logic | Business logic | Business logic Business logic | Business logic 
independent (cloud server) 
Data model Data model Data model Data model 


Application 


High level Shaders High level High level 


rendering Shaders rendering rendering 


Shaders Shaders 


High level High level 


rendering rendering Shaders Low level 


renderin 
Low level Low level Low level Low & mid 9 


rendering rendering rendering level rendering 


GFX API calls | GFX API calls | GFX API calls GFX API calls | GFX API calls 


Our solution: Minimal platform abstraction 


e Thin low level gfx API wrapper 
o Cross reference Vulkan, Metal and WebGPU docs 
o Find the common set of features and differences 
o Design performance optimal way to abstract the differences 


o Metal 2.0: Placement heaps, argument buffers, fences Cuy 
e Trim deprecated stuff Vu | kan. 


o Transform feedback 


a y 


WebGPU 


o Strips, fans 
o Geometry shaders, HW tessellation 
o Vertex buffers? 
m Some mobile devices still benefit 
e Single set of shaders 
o GLSL and use SPIRV-Cross to cross compile [4] 


SPIR 


Low level API design goals 


e Avoid higher level concepts creeping into low level API 
o No mesh or material: Can be represented as VBs + IB and bind groups. 
o No automatic data setup or forced data layout 
o No fixed draw algorithm: Traditional, instancing, etc. Future = GPU-driven? 
o No data loading from disk 


e “Zero” extra API overhead 
o Design core pillar: As easy to use as DX11, but as fast as hand optimized DX12 
o Wrong solution: Implement DX11 driver in your code base 
o Potential performance pitfalls: 
m Fine grained inputs, render state and data copies 
m Resource state tracking, shadow state 
=» PSO + render state and bind group caching (hash tables) 
m Software command buffers 


Traditional 


e Big technical design document 
e Scheduled & split into tasks 
e Design first, then code 


Issues 
Plans locked too early 
Programmer notices architectural 
issues too late 
Refactor impacts production 


? Traditional 


What is the correct process for API design? 


Agile + Test Driven 


e What we need in the next sprints? 
e Implement small pieces of tested 
production ready modular code 


Issues 
e Can't see the forest from the trees 
e Good pieces != good architecture 
e Hard to throw away production 
ready code with 100% tests 


Agile 


Qur solution: Iterative API design process 


Write mock 
e Write mock user land code user land code 
o Create resources: textures, shaders, buffers, passes, etc 
o Setup resources with valid data 
o Render a full frame using the resources (+animate) 
e 


Write mock gfx platform API 
o Don't write any backend implementation code yet 


Compile it with the mock user land code to syntax check 
Program doesn't yet link or run. It’s 100% fine! 
e  lterate until happy 


© 


Write mock 
graphics API 


Refactor! 


© 
O 


Validate 
Add mock use cases whenever needed to improve the coverage 
O 


performance 
Do big architecture refactorings immediately when issues surface 
Do we cleanly implement all gfx APIs? Abstract differences optimally? 
Is the performance good? No allocs, copies, map lookups, etc... Implement 
Finally: Implement the platform backends backends 
o Refactor ASAP if issues are found! 


© 


o) 
o) 
o 


Vulkan/Metal API validation layers == big pre-existing test suites 


Do things at the right frequency and granularity 


e Temporal coherency 
o We are rendering the same game world 60 times per second 
o The camera moves smoothly (most of the time) 
o ~90% of the data is unchanged from the previous frame 

e Operation frequencies 

Once: Load a game world (+ all baked data) 

Low: Load a mesh, texture, material or shader 

OW: | ı texture | | 


oo 8 Oo OO © 


Do all of these inside the draw loop? 


Our solution: Separate lower frequency ops from drawing 


e PSOs 
o Build all pipelines (all render state combinations) at application startup 
o Store the PSO handle to each objects visual component 
e Bind groups 
o Create a bind group per material at level load: Contains all texture and buffer bindings 
o Store the material bind group handle to each objects visual component 
o Changing the material = a single Vulkan/Metal command 
e Data upload 
o Persistent data: Upload once at startup. Delta update when data changes. [5] 
o Dynamic Data: 
m Batch upload whole pass: No per-draw map & unmap 
m Separate by frequency: Per pass | per draw 
e Resource synchronization 
o Render pass: RT texture transitioned to write and then read 
o No state tracking per draw call 


Implementation 


Fast & safe object lifetime tracking 


e Modern practices: Smart pointers, ref counting and RAII? 
o Too slow: Memory allocation per object, scatters data around the memory causing cache 
misses, copy pointer = 2x atomics 
o Safety issues: Ref count runs out while iterating an array causing a destructor RAII side effect, 
maybe in another thread. Using a mutex kills performance 
e Our solution: Arrays! 0 1 2 3 4 5 
o One big allocation for all objects of the same type 
o Array index is a nice data handle oe 
m POD. Trivial to copy and pass around 
o Safe to pass to worker threads AL 
m Can't dereference an array index. Needs access to the array 
o PROBLEM: Old handles referring an array slot that has been reused? 


Pools and handles 


e Pool 
o Typed array of objects 
m Every array slot has a generation counter 
m Counter is increased when the slot is freed 
o Freelist for slot reuse 
m An array (stack) of unused pool indices 
m Delete object = push index 
m Create object = pop index. Resize if needed (no ptrs — safe) 
e Handle 
o POD struct: Array index + generation counter (32/64 bits) 
o pool.get<T>(handle): Compare generations. Not match? — return null 
o Typed Handle<T>. Pool has the same handle type. T is forward declared 
e Weak reference semantics 
o Null check (predictable branch) is almost free on modern CPUs 
o Much better than callbacks in multithreaded systems. No races / mutexes! 


0 1 2 


gen 


match! 


0 1 2 
Hot VS cold data accessed frequently — 


e Easy to use API needs auxiliary data Sold coe Esa 


o Texture can't be just a VkTexture or MTL::Texture 

o Additional data: size, format, data ptr, allocator... 

o Needed for low frequency tasks: 

m Update, readback, sync, create dependent resources, free memory 

e Rendering needs only the hot data 

o Auxiliary data bloats the struct — hurts caches in perf critical draw loop 

o Hate compromising performance and usability :( 
e Our solution: Split hot at cold data inside the pool 

o Pool has two types and two arrays: Hot and cold 


SoA layout 
[6] 


index 
4 


o Both can be accessed with the same handle (using the same array index) gen 


match! 
o Split hot and cold data (investigate). Compromise avoided! 2 


Fast & clean C++20 API for constructing resources 


e Vulkan and DirectX use big structs to initialize complex resources 
o Structs contain other structs and non-owning pointer references to arrays of structs 
o Code bloat. No default values. Lifetime of temporary objects causes bugs 


e Existing solutions 
o Builder pattern: Debug perf is horrible. Release codegen not optimal either 
e Our solution: C++20 designated struct initializers 


o The best C99 feature finally in C++. Waited 11 years! Struct BulferbDece 

o Default values: { 
m Provided by C++11 struct aggregate initialization ANETTE 
m Extremely clean syntax. Best readability uint32 byteSize = 0; 

o Array data? USAGE usage = USAGE_UNIFORM; 


MEMORY memory = MEMORY: :CPU; 


m Custom span that supports initializer lists f::Span<const uint8> initialData; 


m Safety: “const &&” parameter forces temporaries y; 


Resource construction examples 


Handle<Buffer> vertexBuffer = rm->createBuffer({ 
.debugName = "cube", 
.byteSize = vertexSize * vertexAmo, 
.usage = BufferDesc: :USAGE_VERTEX, 
.memory = MEMORY: :GPU_CPU }); 


Handle<Texture> texture = rm->createTexture({ 
.debugName = "lion.png", 
.dimensions = Vector31(256, 256, 1), 
«Format = FORMAT: :RGBA8_SRGB, 
.initialData = Span((uint8*)data, dataSize) 
D 


Handle<BindGroup> material = m_rm->createBindGroup({ 
.debugName = "Car Paint", 
.layout = materialBindingsLayout, 
.textures = { albedo, normal, properties }, 
.buffers = {{.buffer = uniforms, .byteOffset = 64}} 
7); 


m_shader = rm->createShader (4 
.debugName = "mesh simple", 
.VS {.byteCode = shaderVS, .entryFunc = "main"}, 
-PS {.byteCode = shaderPS, .entryFunc = "main"}, 
-bindGroups = { 
{ m_globalsBindingsLayout }, // Globals bind group (@) 
{ materialBindingsLayout }, // Material bind group (1) 
WY; 
.dynamicBuffers = dynamicBindings.getLayout(), 
-graphicsState = { 
.depthTest = COMPARE::GREATER_OR_EQUAL, // inverse Z 
.vertexBufferBindings + 


i 
// Position vertex buffer (0) 
«byteStride = 12, .attributes = i 
{.byteOffset = 0,.format = FORMAT: :RGB32_FLOAT} 
} 
}, 
{ 
// 2nd vertex buffer: tangent, normal, color, texcoord 
.byteStride = 24, .attributes = { 
{.byteOffset = 0,.format = FORMAT: :RGBA16_FLOAT}, 
{.byteOffset = 8,.format = FORMAT: :RGBA16_FLOAT}, 
{.byteOffset = 16,.format = FORMAT: :RGBA8_UNORM}, 
{.byteOffset = 20,.format = FORMAT::RG16_FLOAT} 
} 
}, 


}, 
.renderPassLayout = m_renderPassLayout 
y 


Efficient GPU memory allocation 


e Temp: high frequency 
o Must be extremely fast (million calls per frame) 
o Bump allocate 128MB memory blocks (stored in a ring) 
m Backend heap object contains 1 full sized GPU buffer: Buffer = offset + heap index 
o Backend provides a concrete bump allocator object 
m Allocation function bumps a pointer. Inlines to caller 
m Checks offset >= 128MB — calls backend to provide the next block 
m WebGL2: 32MB CPU memory blocks, glBufferSubData call per render pass 


e Persistent: only when needed! 
o  Two-level segregated fit algorithm 
m O(1) hard real time alloc/free. Uses two level bitfield + 2x Izcnt to find the bin 
m Delete: Merge neighbor blocks on both sides, if they are free 
o Same allocator on Metal (placement heaps) and Vulkan! 
o I open sourced the allocator in Github (MIT license) [7] 


Bind groups, exposed to user land HypeHype bind groups 


Renderpass globals M0) 


e Traditional way: Separate bindings 1 
o Backend creates new bind groups on demand Shader specific 2 


o Problem: Creating new groups is expensive 
o Workaround: Store bind groups in hashmap — SLOW! Dynamic draw data 
e Our solution: User land bind groups Not hardcoded! 


o User constructs an immutable persistent bind group from a set of bindings 
m Example: Material (5 textures + uniform buffer with value data) 

o Draw calls have three bind group slots: 0, 1, 2 (Vulkan Android min spec = 4) 
m Matching the GLSL shader descriptor set slots 
m Group data by bind frequency 

e Abstraction: Dynamic bindings group 

o A flexible way to provide draw data. Only supports buffer bindings (with offset). 

o  Vulkan/WebGPU: set 3 (dynamic offset). Metal: setBuffer + setOffset 

o Push constants? Emulated on many mobile GPUs :( 


A software command buffer, but an order of magnitude faster 


e Initial design: Array of draw structs ee i 
o Only contains “metadata” Handle<Shader> shader; 


o Super simple and fast Handle<BindGroup> bindGroups[3]; 


_ : Handle<DynamicBuffers> dynamicBuffers; 
m 64 bytes = 1 cache line per draw Handle<Buffer> indexBuffer; 


o Actual data inside buffers (inside groups) Handle<Buffer> vertexBuffers[3]; 
m Write temp data from N threads uint32 indexOffset = 0; 


uint32 vertexOffset = 0; 
directly into GPU memory uint32 instanceOffset = 0; 


e Let’s analyze the data uint32 instanceCount = 1; 
uint32 dynamicBufferOffsets[2] = {0}; 


o All fields are 32 bit integers ER ren er 


o Most data doesn’t change between draw }; 
calls when rendering binned content 
o On average 4.5 fields change (-18 bytes) 


Draw stream: Our data interface for draw calls 


Vertex 5 5 vertex 
buffer offset 
16+16 32 


shader 


16+16 


Draw 0 > Draw 1 > Draw 2 > Draw 3 — 


Store only the modified fields of the draw struct 
o uint32 dirty mask tells which fields have modified 


= {0}; 


e User land: Draw stream writer class 
7 E E o Contains a draw struct (current state) + dirty mask 
ea ¿$3 o Setter for each field: if changed — set dirty bit + write field to stream 
e o Draw: write dirty mask in front of the draw (stored offset) 
433: e Backend: Stream decoding 
HERE o For each draw: Read the dirty field bitmask 
= T a" eae m For each set bit: Read field and emit a gfx API call 
= EBBER SE o Advantages: No change tracking in the backend. ~3x reduced BW 


Example: Simple draw loop 


// Per pass bindings 
drawStream.setBindGroup(@, m_globalBindGroup) ; 


// Draw all objects 
for (const SceneObject& sceneObject : sceneObjects) 


{ 


// Bump allocate uniforms (in GPU memory) 


DynamicBinding drawData = tmpAlloc.allocate(sizeof(DrawData) ); 


DrawUniforms* uniforms = (DrawUniforms*)drawData.data; 


Matrix3x4 mat = Matrix3x4::translate(sceneObject.position); 


uniforms->model = mat; 
uniforms->modelInv = Matrix3x3: :inverse(mat); 


// Draw 


drawStream. 
drawStream. 
drawStream. 
drawStream. 
drawStream. 
drawStream. 


setShader (sceneObject.shader); 
setBindGroup (1, sceneObject.material); 
setDynamicBuffers (drawData.buffers); 


setDynamicBufferOffset(@, drawData.byteOffset); 
setMesh(meshes[sceneObject.meshIndex]); 
draw(); 


— Per pass bindings (once) 


— GPU temp bump allocator 


— Write uniforms directly to GPU 


<— DrawStream setters 


< Note: User land mesh 
< Write draw dirty bitmask 


Thank you! 


10,000 draw calls (CPU time) 


icePu AMD RDNA2 + 6800HS 4.7GHz 0.85ms 

7 year old Apple pl 6s + 1.85GHz 11.27ms 
PowerVR a + A53 2.3GHz 20.93ms 

ARM Mali G57 MP1 + A75 1 Ber 15.01ms 

QC Adreno 610 + Kryo 260 2GHz 13.69ms 


Single CPU thread 

Standard non-instanced draws 

No GPU persistent scene data 

No batching: 10,000 mesh and material changes 
>80% time spent in driver 

All devices run cool over long period of time 


iPhone 6s PBR 
@ 60 fps 


Art by Daniel Palmi 
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